%\documentclass[10]{article}      % Specifies the document class
%\documentclass[20]{extarticle}
%\documentclass[[10pt,journal,draftcls,letterpaper,onecolumn]{IEEEtran}
%\documentclass[[10pt,journal,draftcls,letterpaper,onecolumn]{IEEEtran}
%\documentclass[11pt,journal,final,letterpaper,twocolumn]{IEEEtran}
\documentclass[10pt,journal,cspaper,compsoc]{IEEEtran}
%\documentclass[10pt,conference,compsocconf]{IEEEtran}
\usepackage{graphicx}
\usepackage{algorithmic}
%\usepackage{algorithm}
\usepackage{cite}
\usepackage{amsthm}
\usepackage[cmex10]{amsmath}
\usepackage{float}
\usepackage{amsfonts}
\usepackage{multicol}
\usepackage{setspace}
%\usepackage[caption=false, ...]{subfig}
\usepackage{subfig}
%\usepackage[caption=false]{subfig}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{verbatim}
\usepackage[section]{placeins}
\usepackage{url}
\usepackage{rotating}
\usepackage{wrapfig}

\newtheorem{thm}{Theorem}
\newtheorem{cor}[thm]{Corollary}
\newtheorem{lem}[thm]{Lemma}
\newtheorem*{claim}{Claim}
\newtheorem{observation}[thm]{Observation}

%\linespread{1.1}

\newcommand{\ip}[2]{(#1, #2)}

\newcommand{\norm}[1]{\lVert#1\rVert}

%\title{Examining the sublineage structure of \emph{Mycobacterium tuberculosis} complex strains with multiple-biomarker tensors}
\title{A clustering framework for \emph{Mycobacterium tuberculosis} complex strains using multiple-biomarker tensors}
%Evolution of \emph{Mycobacterium tuberculosis} spoligotypes}  % Declares the document's title.

%\date{}
%\author{Cagri Ozcaglar}

\author{\IEEEauthorblockN{Cagri~Ozcaglar\tiny{$^{1}$}, \large{Amina~Shabbeer}\tiny{$^{1}$}, \large{Scott~Vandenberg}\tiny{$^{3}$}, \large{B\"{u}lent~Yener}\tiny{$^{1}$}, \large{Kristin~P.~Bennett}\tiny{$^{1,2}$} \\}
\normalsize
(1) Computer Science Department and (2) Mathematical Sciences Department, Rensselaer Polytechnic Institute\\
(3) Computer Science Department, Siena College \\
ozcagc2@cs.rpi.edu, shabba@cs.rpi.edu, vandenberg@siena.edu, yener@cs.rpi.edu, bennek@rpi.edu \\
}

\begin{document}
\maketitle

\section{Motivation and Goal}
Tuberculosis (TB) is a bacterial disease caused by \emph{Mycobacterium tuberculosis} complex (MTBC), which is a leading cause of death worldwide. Genotyping of MTBC is used to identify and distinguish MTBC into distinct lineages and/or sublineages that are useful for TB tracking and control and examining host-pathogen relationships \cite{VariableHostPathogen}. The major lineages of MTBC are \emph{M. africanum}, \emph{M. canettii}, \emph{M. microti}, \emph{M. bovis}, \emph{M. tuberculosis} subgroup Indo-Oceanic, \emph{M. tuberculosis} subgroup Euro-American, \emph{M. tuberculosis} subgroup East Asian (Beijing) and \emph{M. tuberculosis} subgroup East-African Indian (CAS). While sublineages of MTBC are routinely used in the TB literature, their exact definitions and names have not been clearly established. The SpolDB4 database contains 39,295 strains and their spoligotypes, with the vast majority of them labeled and classified into 62 sublineages \cite{SpolDB4}, but many of these are considered to be ``potentially phylogeographically-specific MTBC genotype families". Therefore, further analysis is needed to confirm these sublineages.

In this study, we develop a tensor clustering framework for sublineage classification of MTBC strains labeled by major lineages based on multiple biomarkers, spoligotype and MIRU, which are the biomarkers typically collected for the purpose of TB surveillance. We generate multiple-biomarker tensors of MTBC strains and apply multiway models for dimensionality reduction. The model accurately captures spoligotype evolutionary dynamics by using contiguous deletions of spacers. The tensor transforms spoligotypes and MIRU into a new representation where traditional clustering methods apply (we use modified k-means clustering) without the users having to decide \emph{a priori} how to combine spoligotype and MIRU patterns. Strains are clustered based on the transformed data without using any information from SpolDB4 families. Clustering results lead to the subdivision of major lineages of MTBC into groups with clear and distinguishable spoligotype and MIRU signatures. Comparison of the clusters with SpolDB4 families suggests dividing and merging some SpolDB4 families, while strongly validating others.

\section{Methods}

Clustering MTBC strains using multiple biomarkers consists of a sequence of steps. This stepwise clustering framework is outlined in Figure \ref{ClusterAnalysisProcess}. First, we generate multiple-biomarker tensors with one mode representing the strains to be clustered, and two other modes representing the two biomarkers. The generation of multiple biomarker tensor is shown in Figure \ref{MultipleBiomarkerTensor}. Second, we apply multiway models, PARAFAC and Tucker3, on the strain mode of the tensor to get a score matrix of strains. Third, we use this score matrix to decide similarity between strains, and cluster them using a stable version of k-means, \texttt{kmeans\_mtimes\_seeded}. In the final step, we evaluate the clustering results using cluster validity indices.

\begin{figure}[h]
%\begin{wrapfigure}{l}{0.40mm}
\begin{center}
%\centering
\includegraphics[width=1.7in]{clusteringframework.eps}
%\includegraphics[width=2in]{FlowChart.eps}
%\includegraphics[width=4in]{FlowChart-extended.eps}
\end{center}
\vspace{-5pt}
\caption{Clustering framework of MTBC strains.}
\vspace{-10pt}
\label{ClusterAnalysisProcess}
%\end{wrapfigure}
\end{figure}

%\begin{wrapfigure}{l}{0.3\textwidth}
\begin{figure}[h]
%\centering
\includegraphics[width=3.0in]{MultipleBiomarkerTensor.eps}
%\includegraphics[width=2in]{FlowChart.eps}
%\includegraphics[width=4in]{FlowChart-extended.eps}
\caption{Biomarker kernel matrix $\vec{s} \otimes \vec{m}$ for each strain forms multiple-biomarker tensor. $\vec{s}$ represents spoligotype deletions and $\vec{m}$ represents MIRU patterns.}
\vspace{-10pt}
\label{MultipleBiomarkerTensor}
\end{figure}
%\end{wrapfigure}



%\begin{figure}[h]
%\centering
%\includegraphics[width=3.0in]{tensorfigure.eps}
%%\includegraphics[width=2in]{FlowChart.eps}
%%\includegraphics[width=4in]{FlowChart-extended.eps}
%\caption{\emph{Strain} $\times$ \emph{spoligotype deletion} $\times$ \emph{MIRU pattern} tensor. Each entry $X(i,j,k)$ of the tensor represents the number of repeats in MIRU pattern $k$ of strain $i$ with spoligotype deletion $j$.}
%\label{TensorFigure}
%\end{figure}

\begin{table*}[]
\centering
\begin{tabular}{|c||c|c|c|c|}
\cline{1-5}
Major Lineage & \# SpolDB4 families & \# Tensor sublineages & F-measure & Average best-match stability \\ \hline \hline
\emph{M. africanum} & 4 & 4 & 0.66 & 1 \\ \cline{1-5}
\emph{M. bovis} & 5 & 3 & 0.71 & 1 \\ \cline{1-5}
East Asian (Beijing) & 2 & 5 & 0.87 & 1 \\ \cline{1-5}
East-African Indian (CAS) & 4 & 3 & 0.82 & 1 \\ \cline{1-5}
Indo-Oceanic & 13 & 11 & 0.57 & 0.90 \\ \cline{1-5}
Euro-American & 33 & 33 & 0.61 & 0.85 \\ \cline{1-5}
\end{tabular}
\caption{Number of SpolDB4 families and number of tensor clusters for each major lineage. F-measure and best-match stability values assess the agreement of the sublineages to the SpolDB4 families and the certainty of tensor sublineages respectively.}
\vspace{-15pt}
\label{SummaryOfResults}
\end{table*}

\section{Results}
We subdivide each of the major lineages of MTBC into sublineages using multiple-biomarker tensors. Overall results for six major lineages are shown in Table \ref{SummaryOfResults}. The F-measure values range from 57\% to 87\% indicating that the sublineages found by the tensor only partially overlap with those of SpolDB4. The four sublineages of \emph{M. africanum} strains found by tensor sublineages are quite distinct as shown the clear separation of the four sublineages in the PCA plot and biomarker signature in Figure \ref{Mafricanum_MarkerSignature}.

%\begin{table}[]
%\centering
%\begin{tabular}{|c||c|c|c|c|}
%\hline
%%                 & \scriptsize{Sublineage 1} & \scriptsize{Sublineage 2} & \scriptsize{Sublineage 3} & \scriptsize{Sublineage 4} \\ \hline \hline
%                 & MA1 & MA2 & MA3 & MA4 \\ \hline \hline
%Stability        & 1         & 1         & 1         & 1 \\ \hline \hline
%AFRI             & 2	 & 1	& 5	 & 0 \\ \hline
%AFRI\_1          & 21	 & 0	& 0	 & 16 \\ \hline
%AFRI\_2          & 0	 & 0	& 12 & 0 \\ \hline
%AFRI\_3          & 0	 & 6	& 1	 & 0 \\ \hline
%\end{tabular}
%\caption{Confusion matrix for 64 distinct \emph{M. africanum} strains showing the correspondence between the SpolDB4 families and tensor sublineages. The stability of each of the tensor sublineages is given in the second row.}
%\label{Mafricanum_ConfusionMatrix}
%\end{table}

\section{Conclusion}
We developed a clustering framework which groups MTBC strains based on their spoligotype and MIRU information via multiple-biomarker tensors. Simultaneous analysis of spoligotype and MIRU through multiple-biomarker tensors and clustering of MTBC strains lead to coherent sublineages of major lineages with clear and distinctive spoligotype and MIRU signatures. The clustering framework used in this study can be further extended to find subgroups of MTBC strains based on other biomarkers such as RFLP and SNPs. We can extend multiple-biomarker tensors and add a new mode for each biomarker added to the genotype representation of strains, e.g. RFLP. This would be a major advancement because there is no way to define a similarity measure between RFLPs of strains other than determining whether or not the patterns match exactly. Future work will involve using various biomarkers to group MTBC strains. Multiple-biomarker tensors with spoligotype, MIRU patterns, and RFLP in modes may lead to a clustering of MTBC strains which is comparable with lineages identified on the basis of SNPs. Since many subfamilies are clearly known and more biomarkers are being developed, the multiple-biomarker tensor can be used in supervised classification to build reliable classifiers of MTBC sublineages and can be used to enhance TB control, epidemiology and research.

\begin{figure}[]
\centering
%\includegraphics[width=2.7in]{Results/Method-2/Mafricanum/MafricanumResultsCondensed.eps}
\includegraphics[width=3.4in]{Result5New.eps}
%\vspace{-5pt}
\caption{PCA plot of clustering, spoligotype signatures and MIRU signatures of tensor sublineages of \emph{M. africanum} strain dataset.}
\vspace{-15pt}
\label{Mafricanum_MarkerSignature}
\end{figure}


\bibliographystyle{IEEEtran}
\bibliography{ComputationalClusterValidationBib}


\end{document} 